Skip to content

Conversation

@jayeshmahajan
Copy link
Member

Summary

This PR adds a new example demonstrating distributed training with PyTorch's Distributed Data Parallel (DDP) on Kubernetes. The example showcases multi-node, multi-GPU training using Kubernetes Jobs with comprehensive support for major cloud providers (GKE, EKS, AKS) and on-premises deployments.

What This Example Demonstrates

  • Distributed Data Parallel (DDP) Training: Multi-node, multi-GPU PyTorch training using DDP
  • Kubernetes Jobs with Indexed Completion: Coordinated parallel training workers using completionMode: Indexed
  • Pod-to-Pod Communication: Headless Services for stable DNS-based worker discovery
  • Persistent Storage: PVCs for training data and model checkpoints
  • Workload-Aware Scheduling: Integration with Kubernetes v1.35+ workload scheduling (optional)

Key Features

1. Distributed Training Setup

  • Uses PyTorch DDP for gradient synchronization across workers
  • Automatic rank assignment from Kubernetes Job completion index
  • Master worker discovery via headless Service DNS
  • DistributedSampler for data sharding across workers

2. Kubernetes Resources

  • Job: Indexed completion mode for stable pod naming and rank assignment
  • Headless Service: Enables direct pod-to-pod communication
  • PersistentVolumeClaims: Separate volumes for training data and outputs
  • ConfigMaps: Training script and hyperparameters
  • Workload: Workload-aware scheduling support (Kubernetes v1.35+)

3. Multi-Cloud and On-Premises Support

  • Base configuration: Generic setup that works across environments
  • Kustomize overlays: Provider-specific configurations for:
    • Google Kubernetes Engine (GKE)
    • Amazon Elastic Kubernetes Service (EKS)
    • Azure Kubernetes Service (AKS)
    • On-premises Kubernetes
  • Comprehensive comments explaining cloud-specific vs generic configurations
  • Storage class guidance for different deployment scenarios

4. Training Script

  • Simple CNN model for CIFAR-10 classification
  • Automatic dataset download (CIFAR-10)
  • Checkpoint saving at each epoch
  • TensorBoard logging support
  • Proper DDP initialization and cleanup

Files Included

  • training-job.yaml - Main Kubernetes Job configuration
  • train.py - PyTorch DDP training script
  • training-script-configmap.yaml - Training script as ConfigMap
  • service.yaml - Headless Service for pod communication
  • data-pvc.yaml / output-pvc.yaml - Persistent storage
  • train-config.yaml - Training hyperparameters
  • workload.yaml - Workload-aware scheduling configuration
  • kustomization.yaml - Kustomize base configuration
  • README.md - Comprehensive documentation

…premises support

This PR generalizes the PyTorch distributed training example to support multiple cloud providers (GKE, EKS, AKS) and on-premises Kubernetes deployments. The changes make cloud-specific configurations explicit through comments while maintaining backward compatibility and adding clear guidance for different deployment environments.

Key changes:
- Added on-premises Kubernetes nodeSelector examples and reorganized cloud provider configurations
- Added comprehensive comments explaining storage access modes and StorageClass options
- Updated documentation to cover all major cloud providers and on-premises deployments equally

Benefits:
- Multi-cloud support with clear guidance for GKE, EKS, AKS, and on-premises
- Better documentation with comprehensive comments
- Easier adoption with environment-specific configuration examples
- Backward compatible - all existing configurations remain functional
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jayeshmahajan
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants